Acknowledgement

We would like to extend our heartfelt thanks to Professor Dootika Vats for her invaluable guidance and support throughout the development of our Book Data Analysis project. Her expert feedback, encouragement, and dedication have played a crucial role in the success of this project. We are deeply grateful for her patience and commitment in helping us overcome the challenges we encountered along the way.

1. Introduction

This report provides an analysis of the top 500 ranked books on Goodreads. Our aim is to uncover patterns in reader preferences, genre popularity, author influence, and other trends based on a comprehensive dataset we obtained through web scraping.

2. About the Data

The data for this analysis was scraped from Goodreads and includes the following features:
- Rank: Book ranking based on popularity and rating.
- Title: Book title.
- Author Name: Author(s) of the book.
- Average Rating: Average user rating of the book.
- Number of Ratings (Rater): Total ratings given to the book.
- Score: Goodreads score based on votes and ratings.
- Voters: Number of users who voted on the book’s score.
- First Published Date: Initial publication date of the book.
- Price: Price of the book.
- Author Followers: Count of followers of the author on Goodreads.
- Number of Reviews: Total reviews for the book.
- Top Genres: Top five genres associated with the book.
- Rating Distribution: Distribution of 1- to 5-star ratings.
- Cover Type: Format of the book cover (e.g., paperback, ebook).
- Author Average Rating: Average rating across all books by the author.
- Medium of Publication: Tye of publication medium for each book listed.

3. Obtaining the Data

4. Assumptions

  1. Ratings Assumption: We assume that people who rate a book have read it, though this may not always be true.
  2. Rater-Buyer Relationship: We assume readers who rate the book have purchased it legally.

5. Possible biases in the data

  1. We assume that every reader who rates a book has purchased it, either physically or digitally, but this overlooks the possibility of pirated versions, meaning raters may not be directly proportional to the number of buyers.

  2. We assume that raters have purchased and read the book, making ratings proportional to book sales. However, some people may rate a book without reading it, potentially influenced by others, which could skew the accuracy of the ratings.

  3. Our dataset is biased toward popular, top-ranked books on Goodreads, excluding lower-ranked titles. To achieve a more balanced dataset, we could include books with a wider range of rankings, including less popular ones.

6. Interesting Questions

  1. What are the basic statistics of the data?
  2. Who are the top authors on basis of book count and their rating ?
  3. What is the distribution of ratings?
  4. Which genres are most prevalent in the top 500 ranked books?
  5. Which authors have the most followers?
  6. How does an author’s popularity (i.e., their ratings) influence the ratings of their books?
  7. Which genres are most commonly associated with top authors?
  8. Do books with more pages tend to receive more or fewer ratings from readers?
  9. Do books with more pages tend to have higher prices?
  10. Are lesser people able to read higher priced books? Do people possibly rate high priced book lower ?
  11. Are there certain genres which have become more popular in the recent years?
  12. What are the genres with highest avg rating
  13. Do hardcover books sell better or not and do they get better avg rating?
  14. Relationship between publication year and number of raters of the book(that is the number of people reading that book).

7. Name of Columns

##  [1] "Rank"                  "Title"                 "authorName"           
##  [4] "avg_rating"            "rater"                 "score"                
##  [7] "voter"                 "price"                 "First_published"      
## [10] "pages"                 "reviews"               "followers"            
## [13] "top_genre"             "second_genre"          "third_genre"          
## [16] "fourth_genre"          "fifth_genre"           "Five_stars"           
## [19] "Four_stars"            "Three_stars"           "Two_stars"            
## [22] "One_stars"             "cover_type"            "author_avg_rating"    
## [25] "medium_of_publication"

8. Basic Statistics and Missing Data

##       Rank          Title            authorName          avg_rating   
##  Min.   :  1.0   Length:500         Length:500         Min.   :3.330  
##  1st Qu.:125.8   Class :character   Class :character   1st Qu.:3.947  
##  Median :250.5   Mode  :character   Mode  :character   Median :4.110  
##  Mean   :250.5                                         Mean   :4.108  
##  3rd Qu.:375.2                                         3rd Qu.:4.280  
##  Max.   :500.0                                         Max.   :4.810  
##                                                                       
##      rater              score             voter           price       
##  Min.   :    1022   Min.   :  23761   Min.   :  252   Min.   : 0.000  
##  1st Qu.:  301825   1st Qu.:  36676   1st Qu.:  464   1st Qu.: 1.990  
##  Median :  543634   Median :  64096   Median :  783   Median : 8.235  
##  Mean   :  934446   Mean   : 210582   Mean   : 2305   Mean   : 7.671  
##  3rd Qu.: 1016434   3rd Qu.: 158691   3rd Qu.: 1805   3rd Qu.:12.592  
##  Max.   :10414224   Max.   :3913869   Max.   :39812   Max.   :74.990  
##                                                       NA's   :200     
##  First_published        pages           reviews         followers     
##  Length:500         Min.   :  26.0   Min.   :    52   Min.   :     1  
##  Class :character   1st Qu.: 254.5   1st Qu.: 11642   1st Qu.:  5362  
##  Mode  :character   Median : 370.0   Median : 22048   Median : 20700  
##                     Mean   : 436.6   Mean   : 36826   Mean   : 81573  
##                     3rd Qu.: 509.5   3rd Qu.: 44941   3rd Qu.: 68700  
##                     Max.   :4100.0   Max.   :295811   Max.   :857000  
##                     NA's   :1                                         
##   top_genre         second_genre       third_genre        fourth_genre      
##  Length:500         Length:500         Length:500         Length:500        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  fifth_genre          Five_stars      Four_stars     Three_stars  
##  Length:500         Min.   :19.00   Min.   : 3.00   Min.   : 2.0  
##  Class :character   1st Qu.:36.00   1st Qu.:29.00   1st Qu.:12.0  
##  Mode  :character   Median :42.00   Median :32.00   Median :16.0  
##                     Mean   :43.86   Mean   :31.11   Mean   :16.3  
##                     3rd Qu.:51.00   3rd Qu.:34.00   3rd Qu.:20.0  
##                     Max.   :90.00   Max.   :42.00   Max.   :33.0  
##                                                                   
##    Two_stars      One_stars     cover_type        author_avg_rating
##  Min.   :0.00   Min.   :0.00   Length:500         Min.   :0.000    
##  1st Qu.:2.00   1st Qu.:1.00   Class :character   1st Qu.:3.920    
##  Median :4.00   Median :1.00   Mode  :character   Median :4.070    
##  Mean   :3.96   Mean   :1.88                      Mean   :4.059    
##  3rd Qu.:5.00   3rd Qu.:2.00                      3rd Qu.:4.210    
##  Max.   :9.00   Max.   :9.00                      Max.   :4.810    
##                                                                    
##  medium_of_publication publishtion_year
##  Length:500            Min.   : 401    
##  Class :character      1st Qu.:1951    
##  Mode  :character      Median :1989    
##                        Mean   :1948    
##                        3rd Qu.:2006    
##                        Max.   :2023    
## 
##                  Rank                 Title            authorName 
##                     0                     0                     0 
##            avg_rating                 rater                 score 
##                     0                     0                     0 
##                 voter                 price       First_published 
##                     0                   200                     0 
##                 pages               reviews             followers 
##                     1                     0                     0 
##             top_genre          second_genre           third_genre 
##                     0                     1                     1 
##          fourth_genre           fifth_genre            Five_stars 
##                     1                     1                     0 
##            Four_stars           Three_stars             Two_stars 
##                     0                     0                     0 
##             One_stars            cover_type     author_avg_rating 
##                     0                     0                     0 
## medium_of_publication      publishtion_year 
##                     0                     0

Conclusion:- Based on the summary statistics and the missing values analysis for the group_6 dataset, here are the key insights and conclusions:
Summary Statistics Overview:
- The dataset ranks 500 top books with ranks ranging from 1 to 500, and an average rating between 3.33 and 4.81, with a median rating of 4.11. This suggests a generally high quality of books.
- Rater, score, and voter columns exhibit a wide range, with maximum values reaching into millions, indicating some highly popular books.
- The price feature has a large range (from 0 to 74.99), suggesting varying cost of books. The median price is 8.235, but there are missing values in this column (200 missing entries).
- Pages count varies significantly, with a minimum of 26 and a maximum of 4,100 (Because the data contains some set of books in individual rank, that’s why the number of pages is higher ) pages. The median value of 370 suggests that most books are moderately lengthy.
- Review counts and follower counts have high maximums (295,811 reviews and 857,000 followers), indicating the presence of very popular authors/books in the dataset.
Missing Values Analysis:
- The price column has a substantial number of missing values (200 out of 500), which might need handling to avoid issues in subsequent analyses.
- The pages column has 1 missing value, while genre columns (second_genre, third_genre, fourth_genre, fifth_genre) also have 1 missing value each, which may slightly impact genre-based analysis.
- Other features are complete, indicating good data quality overall, with minimal missing entries.

9. Top Authors

## [1] "Top 10 authors based on the number of books available in the dataset"
## # A tibble: 10 × 2
##    authorName               n
##    <chr>                <int>
##  1 Stephen         King    11
##  2 J.K. Rowling            10
##  3 Rick Riordan             9
##  4 Cassandra Clare          7
##  5 Richelle Mead            6
##  6 Sarah J. Maas            6
##  7 William Shakespeare      6
##  8 C.S. Lewis               5
##  9 Dr. Seuss                5
## 10 J.R.R. Tolkien           5
## [1] "Top 10 authors based on their avg_rating"
## # A tibble: 10 × 2
##    authorName        avg_rating
##    <chr>                  <dbl>
##  1 Bill Watterson          4.81
##  2 Jerry Weaver            4.81
##  3 Art Spiegelman          4.58
##  4 Brandon Sanderson       4.57
##  5 Rebecca Yarros          4.57
##  6 Sergio Cobo             4.57
##  7 Larry McMurtry          4.54
##  8 Leigh Bardugo           4.54
##  9 Patrick Rothfuss        4.54
## 10 Francine Rivers         4.51

Conclusion: The list shows authors with the most books and highest average ratings, which suggests their popularity and quality.

10. Visualisations

Rating Distribution

Conclusion: The density plot shows that most books have a high average rating, with a peak around 4 stars.

Genre Distribution

Conclusion: Fiction, Fantasy, and Classics are the most common genres among top-rated books.

Countplot of Author’s Followers

## # A tibble: 10 × 2
##    authorName           total_followers
##    <chr>                          <int>
##  1 Stephen         King         9427000
##  2 Rick Riordan                 3906000
##  3 Sarah J. Maas                2748000
##  4 J.K. Rowling                 2280000
##  5 Cassandra Clare              1911000
##  6 Neil Gaiman                  1585000
##  7 Colleen Hoover               1466000
##  8 Veronica Roth                1383000
##  9 Nicholas Sparks               928000
## 10 John Green                    924000

Author’s Influence on Book Rating

## `geom_smooth()` using formula = 'y ~ x'

Conclusion: There is a positive correlation, suggesting that books by highly rated authors are more likely to receive high ratings. From a different perspective, this implies that popular and highly rated authors consistently produce high-quality books.

Author and their specific genres

## `summarise()` has grouped output by 'authorName'. You can override using the
## `.groups` argument.

Conclusion: This heatmap shows a strong association of top authors with some specific genres: for example, J.K. Rowling and Rick Riordan with Fantasy and Nicholas Sparks with Romance. Some authors, like Neil Gaiman and Haruki Murakami, span multiple genres, indicating versatility. Fantasy, Classics, and Fiction are the most popular genres among these authors.

Relation between book length and ratings: Do longer books receive higher or lower ratings?

## `geom_smooth()` using formula = 'y ~ x'

Conclusion:- This plot shows a moderate positive correlation (0.32) between the number of pages in a book and its average rating, suggesting that, to some extent, readers tend to favor longer books over shorter ones.

Relationship Between Book Length and the Number of Raters

## `geom_smooth()` using formula = 'y ~ x'

Conclusion:- We can see a very less negative correlation amongst the number of pages and the number of raters(readers) , hence the number of readers decreases only on significant increase in the book size.

Relationship between book length and price of the book

## `geom_smooth()` using formula = 'y ~ x'

Conclusion:- Although we might have expected the price of the book to increase with the number of pages, the data shows that this is not the case.

Relationship between price and number of readers

Conclusion:- One might have expected that books with lower prices may have more ratings due to higher accessibility, while more expensive books might have fewer ratings.However this can be observed to be not true as we observe that the number of ratings is almost uncorrelated to the price of the book , which may indicate that for these top ranked books people are willing to pay money even for an expensive book.

Highest rated genres

Conclusion:- We can observe that the genres of Graphic novels , Picture books and Childrens books have the highest average rating , which may suggest that children give higher rating to books than adults and maybe that in adulthood we are able to assess things more critically .

Medium of publication

Conclusion:- We can observe that for a given price the number of raters(the people reading it after purchasing) is not affected if the medium of publication is digital or physical .

Conclusion:- One might have thought that for a given price the people might rate the book with physical cover higher than the digital books as reading a physical book has a certain feel to it however we observe that it does not seem to be true as for a given price the medium of publication does not affect the avg rating much.

Publication Year vs number of readers

Conclusion:- We can observe that the books with more recent publication dates that is those of after 2000’s have higher number of ratings and hence can be assumed to have been read much more in the recent years as compared to the older books.

11. Some Additional Visualisations

Cover_type count plot

Conclusion:-

The analysis shows that the majority of books in the dataset have a paperback cover type. This suggests that paperback editions are more prevalent, likely due to their affordability and wider availability compared to hardcover or digital formats.

Average Rating vs Price by Genre

Conclusion:- We can observe that the price does not influence the average rating received by the book. Therefore, our assumption that a higher price might lead to a lower average rating was incorrect across all genres.

Box-plot of Average Rating by Top 5 Genres

This plot shows the box plot of the average rating based on top 5 genres.

12. Overall Conclusion

The analysis highlights several trends in book popularity, genre influence, author reputation, and the relationship between book characteristics like page count and ratings. While certain genres dominate the top ranks, the quality of authors and book content also plays a significant role in reader reception.

13. Challenges Faced

We frequently face timeout issues due to the complexity of the data structure, which leads to long execution times exceeding 30 minutes. The inefficiency in R loops further compounds the problem, as they struggle to process large datasets quickly enough. As a result, the system times out before completing the task. Despite various optimization attempts, the intricate data handling and slow loop execution continue to lead to persistent timeout errors, significantly affecting the performance and efficiency of our code. Another challenge encountered during the project was the missing values in the ‘price’ column for several books. This inconsistency affected the accuracy and efficiency of certain visualizations, particularly those involving price-based analysis.
Another challenge encountered during the project was the missing values in the ‘price’ column for several books. This inconsistency may have affected the accuracy of certain visualizations involving price-based analysis a bit.